56 research outputs found
Learning to see across domains and modalities
Deep learning has recently raised hopes and expectations as a general solution for many applications (computer vision, natural language processing, speech recognition, etc.); indeed it has proven effective, but it also showed a strong dependence on large quantities of data. Generally speaking, deep learning models are especially susceptible to overfitting, due to their large number of internal parameters.
Luckily, it has also been shown that, even when data is scarce, a successful model can be trained by reusing prior knowledge. Thus, developing techniques for \textit{transfer learning} (as this process is known), in its broadest definition, is a crucial element towards the deployment of effective and accurate intelligent systems into the real world.
This thesis will focus on a family of transfer learning methods applied to the task of visual object recognition, specifically image classification. The visual recognition problem is central to computer vision research: many desired applications, from robotics to information retrieval, demand the ability to correctly identify categories, places, and objects.
Transfer learning is a general term, and specific settings have been given specific names: when the learner has access to only unlabeled data from the target domain (where the model should perform) and labeled data from a different domain (the source), the problem is called unsupervised domain adaptation (DA). The first part of this thesis will focus on three methods for this setting.
The three presented techniques for domain adaptation are fully distinct: the first one proposes the use of Domain Alignment layers to structurally align the distributions of the source and target domains in feature space. While the general idea of aligning feature distribution is not novel,
we distinguish our method by being one of the very few that do so without adding losses. The second method is based on GANs: we propose a bidirectional architecture that jointly learns how to map the source images into the target visual style and vice-versa, thus alleviating the domain shift at the pixel level. The third method features an adversarial learning process that transforms both the images and the features of both domains in order to map them to a common, agnostic, space.
While the first part of the thesis presented general purpose DA methods, the second part will focus on the real life issues of robotic perception, specifically RGB-D recognition.
Robotic platforms are usually not limited to color perception; very often they also carry a Depth camera.
Unfortunately, the depth modality is rarely used for visual recognition due to the lack of pretrained models from which to transfer and little data to train one on from scratch.
We will first explore the use of synthetic data as proxy for real images by training a Convolutional Neural Network (CNN) on virtual depth maps, rendered from 3D CAD models, and then testing it on real robotic datasets. The second approach leverages the existence of RGB pretrained models, by learning how to map the depth data into the most discriminative RGB representation and then using existing models for recognition. This second technique is actually a pretty generic Transfer Learning method which can be applied to share knowledge across modalities
A deep representation for depth images from synthetic data
Convolutional Neural Networks (CNNs) trained on large scale RGB databases
have become the secret sauce in the majority of recent approaches for object
categorization from RGB-D data. Thanks to colorization techniques, these
methods exploit the filters learned from 2D images to extract meaningful
representations in 2.5D. Still, the perceptual signature of these two kind of
images is very different, with the first usually strongly characterized by
textures, and the second mostly by silhouettes of objects. Ideally, one would
like to have two CNNs, one for RGB and one for depth, each trained on a
suitable data collection, able to capture the perceptual properties of each
channel for the task at hand. This has not been possible so far, due to the
lack of a suitable depth database. This paper addresses this issue, proposing
to opt for synthetically generated images rather than collecting by hand a 2.5D
large scale database. While being clearly a proxy for real data, synthetic
images allow to trade quality for quantity, making it possible to generate a
virtually infinite amount of data. We show that the filters learned from such
data collection, using the very same architecture typically used on visual
data, learns very different filters, resulting in depth features (a) able to
better characterize the different facets of depth images, and (b) complementary
with respect to those derived from CNNs pre-trained on 2D datasets. Experiments
on two publicly available databases show the power of our approach
From source to target and back: symmetric bi-directional adaptive GAN
The effectiveness of generative adversarial approaches in producing images
according to a specific style or visual domain has recently opened new
directions to solve the unsupervised domain adaptation problem. It has been
shown that source labeled images can be modified to mimic target samples making
it possible to train directly a classifier in the target domain, despite the
original lack of annotated data. Inverse mappings from the target to the source
domain have also been evaluated but only passing through adapted feature
spaces, thus without new image generation. In this paper we propose to better
exploit the potential of generative adversarial networks for adaptation by
introducing a novel symmetric mapping among domains. We jointly optimize
bi-directional image transformations combining them with target self-labeling.
Moreover we define a new class consistency loss that aligns the generators in
the two directions imposing to conserve the class identity of an image passing
through both domain mappings. A detailed qualitative and quantitative analysis
of the reconstructed images confirm the power of our approach. By integrating
the two domain specific classifiers obtained with our bi-directional network we
exceed previous state-of-the-art unsupervised adaptation results on four
different benchmark datasets
Bridging Between Computer and Robot Vision Through Data Augmentation: A Case Study on Object Recognition
Despite the impressive progress brought by deep network in visual object recognition, robot vision is still far from being a solved problem. The most successful convolutional architectures are developed starting from ImageNet, a large scale collection of images of object categories downloaded from the Web. This kind of images is very different from the situated and embodied visual experience of robots deployed in unconstrained settings. To reduce the gap between these two visual experiences, this paper proposes a simple yet effective data augmentation layer that zooms on the object of interest and simulates the object detection outcome of a robot vision system. The layer, that can be used with any convolutional deep architecture, brings to an increase in object recognition performance of up to 7{\%}, in experiments performed over three different benchmark databases. An implementation of our robot data augmentation layer has been made publicly available
Domain Generalization by Solving Jigsaw Puzzles
Human adaptability relies crucially on the ability to learn and merge
knowledge both from supervised and unsupervised learning: the parents point out
few important concepts, but then the children fill in the gaps on their own.
This is particularly effective, because supervised learning can never be
exhaustive and thus learning autonomously allows to discover invariances and
regularities that help to generalize. In this paper we propose to apply a
similar approach to the task of object recognition across domains: our model
learns the semantic labels in a supervised fashion, and broadens its
understanding of the data by learning from self-supervised signals how to solve
a jigsaw puzzle on the same images. This secondary task helps the network to
learn the concepts of spatial correlation while acting as a regularizer for the
classification task. Multiple experiments on the PACS, VLCS, Office-Home and
digits datasets confirm our intuition and show that this simple method
outperforms previous domain generalization and adaptation solutions. An
ablation study further illustrates the inner workings of our approach.Comment: Accepted at CVPR 2019 (oral
AutoDIAL: Automatic DomaIn Alignment Layers
Classifiers trained on given databases perform poorly when tested on data
acquired in different settings. This is explained in domain adaptation through
a shift among distributions of the source and target domains. Attempts to align
them have traditionally resulted in works reducing the domain shift by
introducing appropriate loss terms, measuring the discrepancies between source
and target distributions, in the objective function. Here we take a different
route, proposing to align the learned representations by embedding in any given
network specific Domain Alignment Layers, designed to match the source and
target feature distributions to a reference one. Opposite to previous works
which define a priori in which layers adaptation should be performed, our
method is able to automatically learn the degree of feature alignment required
at different levels of the deep network. Thorough experiments on different
public benchmarks, in the unsupervised setting, confirm the power of our
approach.Comment: arXiv admin note: substantial text overlap with arXiv:1702.06332
added supplementary materia
Adversarial Branch Architecture Search for Unsupervised Domain Adaptation
Unsupervised Domain Adaptation (UDA) is a key issue in visual recognition, as
it allows to bridge different visual domains enabling robust performances in
the real world. To date, all proposed approaches rely on human expertise to
manually adapt a given UDA method (e.g. DANN) to a specific backbone
architecture (e.g. ResNet). This dependency on handcrafted designs limits the
applicability of a given approach in time, as old methods need to be constantly
adapted to novel backbones.
Existing Neural Architecture Search (NAS) approaches cannot be directly
applied to mitigate this issue, as they rely on labels that are not available
in the UDA setting. Furthermore, most NAS methods search for full
architectures, which precludes the use of pre-trained models, essential in a
vast range of UDA settings for reaching SOTA results. To the best of our
knowledge, no prior work has addressed these aspects in the context of NAS for
UDA. Here we tackle both aspects with an Adversarial Branch Architecture Search
for UDA (ABAS): i. we address the lack of target labels by a novel data-driven
ensemble approach for model selection; and ii. we search for an auxiliary
adversarial branch, attached to a pre-trained backbone, which drives the domain
alignment.
We extensively validate ABAS to improve two modern UDA techniques, DANN and
ALDA, on three standard visual recognition datasets (Office31, Office-Home and
PACS). In all cases, ABAS robustly finds the adversarial branch architectures
and parameters which yield best performances.Comment: Accepted at WACV 202
MANAS: Multi-Agent Neural Architecture Search
The Neural Architecture Search (NAS) problem is typically formulated as a
graph search problem where the goal is to learn the optimal operations over
edges in order to maximise a graph-level global objective. Due to the large
architecture parameter space, efficiency is a key bottleneck preventing NAS
from its practical use. In this paper, we address the issue by framing NAS as a
multi-agent problem where agents control a subset of the network and coordinate
to reach optimal architectures. We provide two distinct lightweight
implementations, with reduced memory requirements (1/8th of state-of-the-art),
and performances above those of much more computationally expensive methods.
Theoretically, we demonstrate vanishing regrets of the form O(sqrt(T)), with T
being the total number of rounds. Finally, aware that random search is an,
often ignored, effective baseline we perform additional experiments on 3
alternative datasets and 2 network configurations, and achieve favourable
results in comparison
- …